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Abstract 

In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of 
instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean 
OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the 
bag labels. The learner is then required to emit a classification rule for bags based on the sample. MIL has 
numerous applications, and many heuristic algorithms have been used successfully on this problem, each 
adapted to specific settings or applications. In this work we provide a unified theoretical analysis for MIL, 
which holds for any underlying hypothesis class, regardless of a specific application or problem domain. We 
show that the sample complexity of MIL is only poly-logarithmically dependent on the size of the bag, for 
any underlying hypothesis class. In addition, we introduce a new PAC-learning algorithm for MIL, which 
uses a regular supervised learning algorithm as an oracle. We prove that efficient PAC-learning for MIL can 
be generated from any efficient non-MIL supervised learning algorithm that handles one-sided error. The 
computational complexity of the resulting algorithm is only polynomially dependent on the bag size. 
Keywords: Multiple-instance learning, learning theory, sample complexity, PAC learning, supervised clas- 
sification. 



1. Introduction 

We consider the lea rning problem termed Multiple-Instance Learning (MIL), first introduced in 



Dietterich et alJ (Il997h . MIL is a special type of a supervised classification problem. As in classical super- 
vised classification, in MIL the learner receives a sample of labeled examples drawn i.i.d from an arbitrary 
and unknown distribution, and its objective is to discover a classification rule with a small expected error 
over the same distribution. In MIL additional structure is assumed, whereby the examples are received as 
bags of instances, such that each bag is composed of several instances. It is assumed that each instance has 
a true label, however the learner only observes the labels of the bags. In classical MIL the label of a bag is 
the Boolean OR o f the labels of the instances the bag contains. Various generalizations to MIL have been 



proposed (see e.g. lRaedill998l : IWeidmann et al.Ll2003h . Here we consider both classical MIL and the more 



general setting, where a function other than Boolean OR determines bag labels based on instance labels. 
This function is known to the learner a-priori. We term the more general setting generalized MIL. 

It is possible, in principle, to view MIL as a regular supervised classification task, where a bag is a single 
example, and the instances in a bag are merely part of its internal representation. Such a view, however, 
means that one must analyze each specific MIL problem separately, and that results and methods that apply 
to one MIL problem are not transferable to other MIL problems. We propose instead a generic approach to 
the analysis of MIL, in which the properties of a MIL problem are analyzed as a function of the properties 
of the matching non-MIL problem. As we show in this work, the connections between the MIL and the non- 
MIL properties are strong and useful. The generic approach has the advantage that it automatically extends 
all knowledge and methods that apply to non-MIL problems into knowledge and methods that apply to MIL, 
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without requiring specialized analysis for each specific MIL problem. Our results are thus applicable for 
diverse hypothesis classes, label relationships between bags and instances, and target losses. Moreover, the 
generic approach allows a better theoretical understanding of the relationship, in general, between regular 
learning and multi-instance learning with the same hypothesis class. 

The generic approach can also be helpful for the design of algorithms, since it allows deriving generic 
methods and approaches that hold across different settings. For instance, as we show below, a generic 
PAC-learning algorithm can be derived for a large class of MIL problems with different hypothesis classes. 
Other applications can be fou nd in follow - up res earch of the results we report here, such as a generic 
bag-construction mec hanism (iSabato et al.L I2OI0I) . and learning when bags have a manifold structure 
(IBabenko et all l201lh . As generic analysis goes, it might be possible to improve upon it in some spe- 
cific cases. Identifying these cases and providing tighter analysis for them is an important topic for future 
work. We do show that in some important cases — most notably that of learning separating hyperplanes with 
classical MIL — our analysis is tight up to constants. 

MIL has been used in numerous applications. In IPietterich et al.l (119971) the drug design application 
motivates this setting. In this application, the goal is to predict which molecules would bind to a specific 
binding site. Each molecule has several possible conformations (shapes) it can take. If at least one of the 
conformations binds to the binding site, then the molecule is labeled positive. However, it is not possible 
to experimentally identify which conformation was the successful one. Thus, a molecule can be thought of 
as a bag of conformations, where each conformation is an instance in the bag representing the molecule. 
This application employs the hypothesis class of Axis Parallel Rectangles (APRs), and has made APRs the 
hypothesis class of choice in several theoretical works that we mention b elow. There are many other appli- 
cations for MIL, in cluding image classification (Maron and Ratanill998[) . web index page recommendation 
dZhou et al.Ll2005h and text categorization ( Andrews, 2007) . 

Previous theoretical analysis of the c omputational aspec t s of MIL has been done i n two main sett i ngs. I n 
the first setting, analyzed for instance in lAuer et alJ (119981) : iBlum and Kalail (ll998r) : lLong and Tanl(ll998l) . 
it is assumed that all the instances are drawn i.i.d from a single distribution over instances, so that the 
instances in each bag are statistically independent. Under this independence assumption, learning from an 
i.i.d. sample of bags is as easy as learning from an i.i.d. sample of instances with one-sided label noise. 
This is stated in the following theorem. 



Theorem 1 JBlum and Kalai,ll998l) If a hypothesis class is PAC-learnable in polynomial time from one- 
sided random classification noise, then the same hypothesis class is PAC-learnable in polynomial time in 
MIL under the independence assumption. The computational complexity of learning is polynomial in the 
bag size and in the sample size. 

The assumption of statistical independence of the instances in each bag is, however, very limiting, as it is 
irrelevant to many applications. 

In the second setting one assumes that bags are drawn from an arbitrary distribution over bags, so that 
the instances within a bag may be statistically dependent. This is clearly much more useful in practice, 
since bags usually describe a complex object with internal structure, thus it is implausible to assume even 
approximate independence of ins tances in a bag. F or the hypothesis class of APRs and an arbitrary dis- 
tribution over bags, it is shown in lAuer et al.l ( 119981) that if there exists a PAC-learning algorithm for MIL 
with APRs, and this algorithm is polynomial in both the size of the bag and the dimension of the Euclidean 
space, then i t is possible to poly nomially PAC-learn DNF formulas, a problem which is solvable only if 
TIV = MV dPitt and VaUantl[l986.) . In addition, if it is possible to improperly learn MIL with APRs (that 
is, to learn a classifier which is not itself an APR), then it is possible to improperly learn DNF formulas, 
a problem which has not been solved to this date for general distributions. This result implies that it is 
not possible to PAC-learn MIL on APRs using an algorithm which is efficient in both the bag size and the 
problem's dimensionality. It does not, however, preclude the possibility of performing MIL efficiently in 
other cases. 
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In practice, numerous algorithms have been proposed for MIL, each focusing on a different special- 
ization of this problem. Almost none of these algorithms assume statistical independence of instances in a 
bag. Moreover, some o f the algorithms explicitly exploit presumed dependences between instances in a bag. 



IPietterich e t alJ (Il997h propose severa l heuristic alg orithms for finding an APR that p redicts the label of an 



instan ce and of a bag. Diverse Density dMaron and Lozano-Perez, 1998) and EM-D D dZhang and Goldman 



20011) employ assumptions on the structure of the b ags of instances. DPBoost J Andrews and Hofrnann , 
2003), mi-SVM and MI-SVM (! Andrews et"ail 1200 2 *). and Multi-Instance Kernels ( Gartner et al.Ll2002h are 
approaches for learning MIL using margin-based objectives. Some of these methods work quite well in 
practice. However, no generalization guarantees have been provided for any of them. 

In this work we analyze MIL and generalized MIL in a general framework, independent of a specific 
application, and provide results that hold for any underlying hypothesis class. We assume a fixed hypothesis 
class defined over instances. We then investigate the relationship between learning with respect to this 
hypothesis class in the classical supervised learning setting with no bags, and learning with respect to the 
same hypothesis class in MIL. We address both sample complexity and computational feasibility. 

Our sample complexity analysis shows that for binary hypothesis and thresholded real-valued hypothe- 
ses, the distribution-free sample complexity for generalized MIL grows only logarithmically with the max- 
imal bag size. We also provide poly-logarithmic sample complexity bounds for the case of margin learn- 
ing. We further provide distribution-dependent sample complexity bounds for more general loss functions. 
These bound are useful when only the average bag size is bounded. The results imply generalization bounds 
for previously proposed algorithms for MIL. Addressing the computational feasibility of MIL, we provide 
a new learning algorithm with provable guarantees for a class of bag-labeling functions that includes the 
Boolean OR, used in classical MIL, as a special case. Given a non-MIL learning algorithm for the desired 
hypothesis class, which can handle one-sided errors, we improperly learn MIL with the same hypothesis 
class. The construction is simple to implement, and provides a computationally efficient PAC-learning of 
MIL, with only a polynomial dependence of the run time on the bag size. 

In this work we consider the problem of learning to classify bags using a labeled sample of bags. We 
do not attempt to learn to classify single instances using a labeled sample of bags. We point out that it 
is not generally possible to find a low-error classification rule for instances based on a bag sample. As a 
simple counter example, assume that the label of a bag is the Boolean OR of the labels of its instances, and 
that every bag includes both a positive instance and a negative instance. In this case all bags are labeled as 
positive, and it is not possible to distinguish the two types of instances by observing only bag labels. 

The structure of the paper is as follows. In Section |2] the problem is formally defined and notation is 
introduced. In Section[3]the sample complexity of generalized MIL for binary hypotheses is analyzed. We 
provide a useful lemma bounding covering numbers for MIL in Section ID In Section |5] we analyze the 
sample complexity of generalized MIL with real-valued functions for large-margin learning. Distribution- 
dependent results for binary learning and real-valued learning based on the average bag size are presented 
in Section |6l In Section Q we present a PAC-learner for MIL and analyze its properties. We conclude in 
Section [8] The appendix includes techni cal proofs that have been omitted from the text. A preliminary 
version of this work has been published as lSabato and Tishbvl (120091) . 

2. Notations and Definitions 

For a natural number k, we denote [fc] = {1, . . . , k}. For a real number x, we denote [x]+ = max{0, x}. 
log denotes a base 2 logarithm. For two vectors x, y G R", (x, y) denotes the inner product of x and y. 
We use the function sign : M. — > { — 1,+!} where sign(a;) = 1 if a; > and sign(a::) = — 1 otherwise. For 
a function f : A -^ B,we denote hy f^Q its restriction to a set C C A. For a univariate function /, denote 
its first and second derivatives by /' and /" respectively. 

Let X be the input space, also called the domain of instances. A bag is a finite ordered set of instances 
from X. Denote the set of allowed sizes for bags in a specific MIL problem by i? C N. For any set A 
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we denote A^^^ = U„gflA". Thus the domain of bags with a size in R and instances from X is X'^-^-K 
A bag of size n is denoted by x = (a;[l], . . . , x[n]) where each x[j] G A" is an instance in the bag. We 
denote the number of instances in x by |x|. For any univariate function f : A ^ B, we may also use its 
extension to a muhivariate function from sequences of elements in A to sequences of elements in B, defined 
by/(a[l],...,a[fc])-(/(a[l]),...,/(a[fc])). 

Let / C M an allowed range for hypotheses over instances or bags. For instance, / = { — 1,+1} 
for binary hypotheses and / — [—B, B] for real-valued hypotheses with a bounded range. "H C J*^ is a 
hypothesis class for instances. Every MIL problem is defined by a fixed bag-labeling function i/i : J^^' -^ I 
that determines the bag labels given the instance labels. Formally, every instance hypothesis h : X ^^ I 
defines a bag hypothesis, denoted by h : X^^'^ -^ I and defined by 

Vx e A-^^), h{yL) ^ ^{h{x[l]), ..., h{x[r])). 

The hypothesis class for bags given H and ip is denoted H — {h \ h (^ H}. Importantly, the identity of ■)/' 
is known to the learner a-priori, thus each ip defines a different generalized MIL problem. For instance, in 
classical MIL, / = {-!, +1} and ^p is the Boolean OR. 

We assume the labeled bags are drawn from a fixed distribution D over A"'^' x { — 1, +1}, where 
each pair drawn from D constitutes a bag and its binary label. Given a range / C M of possible label 
predictions, we define a loss function £ : { — 1, +1} x / ^ M, where £{y, y) is the loss incurred if the 
true label is y and the predicted label is y. The true loss of a bag-classifier h : X^-^^ -^ I is denoted by 
i{h, D) = Efx y)~d[^O^i f^P^))]- Ws say that a sample or a distribution are realizable by H if there is a 
hypothesis h G H that classifies them with zero loss. 

The MIL learner receives a labeled sample of bags {(xi, j/i), . . . , (xm, 2/m)} ^ X^-^^ x { — 1,+!} 
drawn from Z?™, and returns a classifier h : A"*^^^ -^ I. The goal of the learner is to return h that 
has a low loss £{h, D) compared to the minimal loss that can be achieved with the bag hypothesis class, 
denoted by i*{'H, D) = inf^g:^£(ft,, D). The empirical loss of a classifier for bags on a labeled sample 
S is £{h, S) = E(x,F)~s[^(^; f^O^))]- For an unlabeled set of bags 5* = {xi}i£[„j], we denote the set of 
instances in the bags of S' by S"-^ = {xi[j] \ i £ [m],j E [|xi|]}. 

Classes of Real- Valued bag-functions 

In classical MIL the bag function is the Boolean OR over binary labels, that is / = {— 1,+1} and ip — 
OR : {— Ij+l}'-^-' — > {—1,+!}. A natural extension of the Boolean OR to a function over reals is the 
max function. We further consider two classes of bag functions over reals, each representing a different 
generalization of the max function, which conserves a different subset of its properties. 

The first class we consider is the class of bag-functions that extend monotone Boolean functions. Mono- 
tone Boolean functions map Boolean vectors to {—1, +1}, such that the map is monotone-increasing in each 
of the inputs. The set of monotone Boolean functions is exactly the set of functions that can be represented 
by some composition of AND and OR functions, thus it includes the Boolean OR. The natural extension of 
monotone Boolean functions to real functions over real vectors is achieved by replacing OR with max and 
AND with min. Formally, we define extensions of monotone Boolean functions as follows. 

Definition 2 A function from M" into R is an extension of an n-ary monotone Boolean function if it belongs 
to the set Al„ defined inductively as follows, where the input to a function is z G M".- 

(l)Vje[n], zh^z[j]eA^„; 

(2)VfceN+, fi,...Jk&Mn =^zh^maxjg[fe]{/j(z)} e A^„; 

(3) Vfc e N+, /i, . . . , fk e Mn=^z^ minj-g[fe]{/j(z)} e Mn- 

We say that a bag-function tp : M.^^' — > M. extends monotone Boolean functions if for all n E R, V'|R" G 

Mn- 
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The class of extensions to Boolean functions thus generalizes the max function in a natural way. 

The second class of bag functions we consider generalizes the max function by noting that for bounded 
inputs, the max function can be seen as a variant of the infinity-norm 1 1 z 1 1 oo — max \z[i]\. Another natural 
bag-function over reals is the average function, defined as V'(z) = ^ J2ie\n] -^j' which can be seen as a 
variant of the 1-norm ||z||i — X^jgfni I^WI- More generally, we treat the case where the hypotheses map 
into / = [—1, 1], and consider the class of bag functions inspired by a p-norm, defined as follows. 

Definitions Forp<E [l,oo), f/ie p-norm bag function V'p : [— 1, +1]^''^'' — > [—1,+!] is defined by: 



\ 1=1 



i/p 



For p = oo, Define t/iqc = limp-yoo V'p- 

Sine the inputs of ipp are in [—1, +1], we have V'p(z) = "nT^I'P ■ ||z + lj|p — 1 where n is the length of 
z. Note that the average function is simply t^i, and i/^oo = ||z + l||oo — 1 = max. Other values of p fall 
between these two extremes: Due to the p-norm inequality, which states that for all p e [1, oo) and x G M", 

||x||i < ||x||p < ni/P||x||oo, we have that for all ze [-!,+!]" 

average = '!/'i(2:) < '/'plz) < '!/'oo(z) = max. (1) 

Many of our results hold when the scale of the output of the bag-function is related to the scale of its 
inputs. Formally, we consider cases where the output of the bag-function does not change by much unless 
its inputs change by much. This is formalized in the following definition of a Lipschitz bag function. 

Definition 4 A hag function t/j : M^^) — > M is c-Lipschitz with respect to the infinity norm /or c> if 

Vnei?,Va,beM", |7/>(a) - V(b)| <c||a-b||oo. 

The average bag-function and the max bag functions are 1-Lipschitz. Moreover, all extensions of monotone 
Boolean functions are 1-Lipschitz with respect to the infinity norm — this is easy to verify by induction on 
Def.|2l All p-norm bag functions are also 1-Lipschitz, as the following derivation shows: 

iVp(a) - ijpM = n-^/P • I ||a+ l||p - lib + l||p| < n'^/P • ||a - b||p < ||a - b||oo. 

Thus, our results for Lipschitz bag-functions hold in particular for the two bag-function classes we have 
defined here, and in specifically for the max function. 

3. Binary MIL 

In this section we consider binary MIL. In binary MIL we let/ = {—1, +1}, thus we have a binary instance 
hypothesis class H C { — 1, +1}'^. We further let our loss be the zero-one loss, defined by io/iiUy v) = 
l[y ^ y]. The distribution-free sample complexity of learning relative to a binary hypothesis class with th e 
zero-one loss is governed by the VC-dimension of the hypothesis class (Vapni k and Chervonenkisill97lb . 
Thus we bound the VC-dimension of "H as a function of the maximal possible bag size r = max R, and of 
the VC-dimension of Ti . We show that the VC-dimension of H is at most logarithmic in r, and at most linear 
in the VC-dimension of H, for any bag-labeling function t/) : { — 1,+1}(^' — )■ { — 1,+1}. It follows that the 
sample complexity of MIL grows only logarithmically with the size o f the bag. Thus MIL is feasible even 
for quite large bags. In fact, based on the results we show henceforth. ISabato et al.l (120 lOh have shown that 
MIL can sometimes be used to accelerate even single-instance learning. We further provide lower bounds 
that show that the dependence of the upper bound on r and on the VC-dimension of H is imperative, for a 
large class of Boolean bag-labeling functions. We also show a matching lower bound for the VC-dimension 
of classical MIL with separating hyperplanes. 
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3.1 VC-Dimension Upper Bound 

Our first theorem establishes a VC-Dimension upper bound for generaUzed MIL. To prove the theorem we 
require the following useful lemma. 

Lemma 5 For any i? C N and any bag function ip : { — 1, +1}*-^-' — t- { — 1, +1}, and for any hypothesis 
class Ti C { — 1,+!}''^ and a finite set of bags S '^ X^^\ 

[l~L\s\ < I'^is^l- 

Proof Let hi,h2 G "H be bag hypotheses. There exist instance hypotheses gi,g2 G 'H such that g^ — hi 
for i — 1,2. Assume that h^g ^ ^2|5- We show that gi|5u 7^ <72|su, thus proving the lemma. 

From the assumption it follows that 'g^^g 7^ 32 15- Thus there exists at least one bag x e 5 such that 

52(x) 7^ 52(x)- Denote its size by n. We have vi(5i(x[l]), . . . ,5i(a:[n])) 7^ tp{g2ix[l]), . . . ,g2{x[n])). 
Hence there exists a j G [n] such that gi{x[j]) 7^ g2{x[j]). By the definition of S'~', x[j] G S'~'. Therefore 



Theorem 6 Assume that % is a hypothesis class with a finite VC-dimension d. Let r G N and assume that 
R C [r\. Let the bag-labeling function tp : {—1, +1}^^' — > {—1, +1} be some Boolean function. Denote 
the VC-dimension ofH by dr. We have 

dr < max{16, 2(ilog(2er)}. 

Proof For a set of hypotheses J', denote by J'^^ the restriction of each of its members to A, so that 
JTa — {h\A I h € iJ}. Since dr is the VC-dimension of H, there exists a set of bags S C X'^^'f of size 
d that is shattered by % so that [Hisl = 2'^'-. By Lemma|5]|W|,9l < I'HLqu I, therefore 2'^- < |7^ | ,qu|. In 



addition, R C [r] implies |5'-^ | < rdr. By applying Sauer's lemma ( ISauerlll972l : IVapnik: and Chervonenkis , 
[197 1.) to H we get 

Where e is the base of the natural logarithm. It follows that dr < d{\og{er) — \ogd) + dlogdr- To provide 
an explicit bound for dr, we bound d log dr by dividing to cases: 

1. Either d log dr < ^dr, thus dr < 2d{log{er) — logd) < 2dlog{er), 

2. or ^dr < dlogdr. In this case, 

(a) either d,. < 16, 

(b) or dr > 16. In this case ^/d^ < dr/\ogdr < 2d, thus dlogdr — 2d\og^/(U < 2d\og2d. 
Substituting in the implicit bound we get dr < d{\og{er) — logd) + 2d\og2d < 2d\og{2er). 

Combining the cases we have dr < max{16, 2d\og{2er)}. ■ 



3.2 VC-Dimension Lower Bounds 

In this section we show lower bounds for the VC-dimension of binary MIL, indicating that the dependence 
on d and r in Theorem|6]is tight in two important settings. 

We say that a bag-function tp : { — 1,+1}(^) -^ { — 1,+1} is r-sensitive if there exists a num- 
ber n E R and a vector c e { — 1,+1}" such that for at least r different numbers ji, . . . ,jr G [n]. 
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/(c[l], . . . , c[ji], . . . , c[n]) 7^ /(c[l], . . . , — c[ji], . . . , c[n]). Many commonly used Boolean functions, such 
as OR, AND, Parity, and all their variants that stem from negating some of the inputs, are r-sensitive for 
every r E R. Our first lower bound shows if ip is r-sensitive, the bound in Theorem|6] cannot be improved 
without restricting the set of considered instance hypothesis classes. 

Theorem 7 Assume that the bag function "ip ■ {^Ij +1}'-^' — > {^1? +1} '■^ r-sensitive for some r E N. 
For any natural d and any instance domain X with \X\ > rd\log{r)\, there exists a hypothesis class % 
with a VC-dimension at most d, such that the VC dimension ofH is at least d\\og{r)\ . 

Proof Since ij] is r-sensitive, there are a vector c E { — 1, +1}" and a set J C n such that | J| = r and 

Vj e J, '0(c[l], . . . , c[n]) 7^ V'lcil], ■ • ■ , — c[j'], . . . , c[n]). Since V" maps all inputs to { — 1, +1}, it follows 
that Vj E J, V'(c[l], ■ • ■ , — c[j], . . . ,c[n]) = — ?/'(c[l], . . . ,c[n]). Denote a — V'(c[l], ■ • ■ , c[n\). Then we 
have 

Vje J,2/e{-l,+l}, ^(c[l],...,c[j]-y,...,c[n])=a-2/. (2) 

For simplicity of notation, we henceforth assume w.l.o.g. that n ^ r and ,/ = [r]. 

Let S C X^ be a set of d [log (r) J bags of size r, such that all the instances in all the bags are distinct 
elements of A'. Divide S into d mutually exclusive subsets, each with [log(r)J bags. Denote bag p in subset 
t by X(p t) . We define the hypothesis class 

n^{h[h,...,kd] |V*e[d],fc, e[2Li°sMJ]}, 

where h[ki, . . . ,kd] is defined as follows (see illustration in Table [T]): For x E X which is not an instance 
of any bag in S, h[ki, ... ,kd] = — 1. For a; = X(p_f)[j'], let fo(p,,i) be bit pin the binary representation of the 
number n, and define 



h[ki 



,kd\{x(p,t}[.i] 



a(26(pj_i) - 1) j = kt, 



t 


P 


Instance label h{x(^p^t) H) 


Bag label 7i(xj) 


1 


1 
2 
3 


\\\'*_\\\\ 


+ 
+ 


2 


1 
2 
3 


------- + 

------- + 

------- + 


+ 
+ 
+ 


3 


1 
2 
3 


- + ------ 


+ 



Table 1: An example of the hypotheses h — /i[4, 8, 3], with / = OR (so that c is the all —1 vector), 
and d = 3. Each line represents a bag in 5", each column represents an instance in the bag. 



We now show that S is shattered by "H, indicating that the VC-dimension of Ti, is at least \S\ 
d[log(r)J . To complete the proof, we further show that the VC-dimension of % is no more than d. 
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S is shattered by H: Let {y{p,t)}pe Liog(r)j .te[d] be some labeling over { — 1, +1} for the bags in S. For 
each t e [d] let 



Llog(r)J 

Then by Eq. dUi, for all p e [[log(r)J] and te [d]. 



ViP.t) + 1 



2P- 



h[ki,...,kd]{yi(p,t)) 



/(c[l], . . . , c[kt] ■ a(26(p_fej_i) - 1), . . . 
a^(2fe(p,fc,_i) - 1) = 2b(^p^kt-i) - 1 = 



y(p.t)- 



Thus h[ki, . . . ,kd] labels 5 according to {y^p^t}}- 



The VC-dunension of H is no more than d: Let A C A" of size d + 1. If there is an element in A which 
is not an instance in S then this element is labeled —1 by all /i € H, therefore A is not shattered. Otherwise, 
all elements in A are instances in bags in 5. Since there are d subsets of S, there exist two elements in A 
which are instances of bags in the same subset t. Denote these instances by x{pi,t)[ji] and x{p2,t)[J2]- 
Consider all the possible labelings of the two elements by hypotheses in H. If A is shattered, there must 
be four possible labelings for these elements. However, by the definition of h[ki, . . . ,kd] it is easy to see 
that if ji = J2 — j then there are at most two possible labelings by hypotheses in H, and if ji ^ J2 then 
there are at most three possible labelings. Thus A is not shattered by H, hence the VC-dimension of H is 
no more than d. ■ 

Theorem[TO]below provides a lower bound for the VC-dimension of MIL for the important case where 
the bag-function is the Boolean OR and the hypothesis class is the class of separating hyperplanes in W\ 
For w e M", the function h-^ : R" -^ { — 1,+1} is defined by h^{x) = sign((w,x)). The hypothesis 
class of linear classifiers is yV„ = {/iw | w G K"}. Let r G N. We denote the VC-dimension of Wn for 
R = {r} and ^ = OR by d^.n- We prove a lower bound for rf^.n using two lemmas: Lemma [8] provides a 
lower bound for dr.s, and Lemma|9]links dr,n for small n with rf^.n for large n. The resulting general lower 
bound, which holds for r = max R, is then stated in TheoremfTOl 

Lemma 8 Let dr.n be the VC-dimension ofWn o.s defined above. Then dr^s > [log(2r)J . 

Proof Denote L — [log(2r)J . We will construct a set 5 of L bags of size r that is shattered by Wa. The 
construction is illustrated in Figure [T] 




Figure 1: An illustration of the constructed shattered set, with r = A and L = log 4 + 1 = 3. Each 
dot corresponds to an instance. The numbers next to the instances denote the bag to which an 
instance belongs, and match the sequence N defined in the proof. In this illustration bags 1 and 
3 are labeled as positive by the bag-hypothesis represented by the soUd Une. 
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Let n = (ni, . . . , iik) be a sequence of indices from [L], created by concatenating all the subsets of 
[L] in some arbitrary order, so that K — L2^^^, and every index appears 2^^^ < r times in n. Define a 
set A ^ {ak \ k e [K]} C M^ where afc = (cos(27rfc/ii:), sin(27rA:/A'), 1) e R^, so that ai, . . . , a_R- are 
equidistant on a unit circle on a plane embedded in R'^. Define the set of bags S — {xi, . . . , x^} such that 
Xi ^ {xi[l],..., Xi[r]) where {xi[j] \je H} = {ak \ Uk == i}. 

We now show that S is shattered by Ws: Let (yi, . . . , y^) be some binary labeling of L bags, and let 
Y — {'i' \ Hi — +!}■ ^y the definition of n, there exist ji,J2 such that Y = {uk \ ji < k < J2}- Clearly, 
there exists a hyperplane w G R"^ that separates the vectors {a^ | ji < fc < J2} from the rest of the vectors 
in A. Thus sign((w, afc)) = +1 if and only if ji < k < J2- It follows that /iw(xi) = +1 if and only if 
there is a A; e {ji, . . . , J2} such that a^ is an instance in x^, that is such that Uk = i- This condition holds 
if and only if i e F, hence h^ classifies S according to the given labeling. It follows that S is shattered by 
W3, therefore dr,3 > \S\ ^ [log{2r)\ . ■ 



Lemma 9 Let k, n, r be natural number such that k < n. Then dr^n > [n/fcj dr^k- 

Proof For a vector x e R''" and a number t E {0, . . . ,[n/k\} define the vector s(x, i) = 
{0, . . . ,0,x[l], . . . ,x[k],0, . . . ,0) E R", where a;[l] is at coordinate kt + 1. Similarly, for a bag 

X, = (x,[l], . . . ,x,[r]) e (R'=)% define the bag s(x„t) ^ (s(x4l], i), . . . , s(x,[r],i)) E (R")_^ 

Let Sk — {xi},jg[(;^ J.] C (R'^)'' be a set of bags with instances in R*^ that is shattered by Wk- Define 
Sn, a set of bags with instances in R": S'„ = {s(x,;, i)]}ie[d^ k],te[[n/ki] ^ (R")'^- Then Sn is shattered by 
yVn'- Let {y{i^t)}iGldr k]-telln/k\] be some labeling for 5„. Sk is shattered by Wk, hence there are separators 

wi,...,WL„/fcj e R*-' such that Vi e [dr^k],t E [n/k\, h^^yii) = yi^i^t)- 
Set w = X]t"o 'S(wt,t). Then (w, s(x, i)) — (w(,x). Therefore 

7iw(s(xi,t)) = OR(sign((w, s(x.,[l],t))),...,sign((w, s(xi[r],t)))) 

= OR(sign((wt,x4l])),...,sign((wt,x4r]))) = h^^{±{) =2/(j,t). 
Sn is thus shattered, hence d^.n > I"?-™! = L"-/'''J'^»',fc- B 

The desired theorem is an immediate consequence of the two lemmas above, by noting that whenever r E R, 
the VC-dimension of yV„ is at least dr^n- 

Theorem 10 Let yV„ be the class of separating hyperplanes in R" as defined above. Assume that the bag 
function is ip — OR and the set of allowed bag sizes is R. Let r — inaxi?. Then the VC-dimension ofWn 
is at least [n/3j [log 2rJ . 

3.3 Pseudo dimension for thresholded functions 

In this section we consider binary hypothesis classes that are generated from real-valued functions using 
thresholds. Let T C R*^ be a set of real valued functions. The binary hypothesis class of thresholded 
functions generated by T is Tjr = {a; n- sign(/(x) — z) \ f E JF, z E R}. The sample complexity of 
learning with Tjr and the zero-on e loss is governed by the pseudo-dimension of T, which is equal to the VC- 



dimension of Tjr (|PoIlardLll984^ ■ In this section we consider a bag-labeling function ip : W ' -^ R, and 
bound the pseudo-dimension of T, thus providing an upper bound on the sample complexity of binary MIL 
with Ty. The following bound holds for bag-labeling functions that extend monotone Boolean functions, 
defined in Def . |2] 

Theorem 11 Let J- C R*^ be a function class with pseudo-dimension d . Let R C [r], and assume that 
ij} : W-^^ — > R extends monotone Boolean functions. Let dr be the pseudo-dimension of J-. Then 

dr < max{16, 2(ilog(2e7-)}. 
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Proof First, by Def.|2l we have that for any ip which extends monotone Boolean functions, any n G R and 

any y e M", 

sign{tp{y[l], ..., y[n]) - z) = sign{tp{y[l] - z,..., y[n] - z)) = '(/;(sign(y[l] - z, . . . , y[n] - z)). (3) 

This can be seen by noting that each of the equahties holds for each of the operations allowed by Ain for 
each n, thus by induction they hold for all functions in A^„ and all combinations of them. 

For a real- valued function / and a scalar z e M, let /^ be the function into { — 1 , + 1 } defined by /^ (y ) — 
sign(/(y) - z). For all f e T and z e M, T^ = {/, | / e -F, z G M}, and T^ = {(f), | / g J", z g R} 
(we use the parentheses in (f), to emphasize that the bag operation is applied to / and not to /^). In 
addition, for all / g J^, z g M, n g i? and x g A"", we have 

(7),(x) = sign(7(x) - z) = sign(^(/(a;[l]), . . . , f{x[n])) - z) 

= V;(sign(/(x[l]) - z, . . . , f{x[n\) - z)) (4) 

= V^(/,(x[l]),...,/,(a;[n]))-7T(x). 

where the equality on line (|4]i follows from Eq. Q. Therefore 



TT={{f)z I / e -F,z e M} = {/, I / g J-,z g K} = {/i I /i g T^} = (T». 

The VC-dimension of Tjr is equal to the pseudo-dimension of F, which is d. Thus, by Theorem |6l the 
VC-dimension of Ty is bounded by max{16, 2d log(2er)}. The proof is completed by noting that dr, the 
pseudo-dimension of F, is exactly the VC-dimension of Ty. ■ 

This concludes our results for distribution-free sample complexity of Binary MIL. In Section |6] we 
provide sample complexity analysis for distribution-dependent binary MIL, as a function of the average 
bag size. 

4. Covering Numbers bounds for MIL 

Covering numbers are a useful measure of the complexity of a function class, since they allow bounding 
the sample complexi t y of a class in various settings, based on uniform convergence guarantees (see e.g. 



Anthonv and Bartletj Il999h . In this section we provide a lemma that relates the covering numbers of 



bag hypothesis classes with those of the underlying instance hypothesis class. We will use this lemma in 
subsequent sections to derive sample complexity upper bounds for additional settings of MIL. Let F C MA 
be a set of real- valued functions over some domain A. A 7-cover of F with respect to a norm 1 1 • | j o defined on 
functions is a set of functions C C MA such that for any f <^ F there exists ag £ C such that ||/ — gHo < 7- 
The covering number for given 7 > 0, J-" and o, denoted by A/'(7, F, o), is the size of the smallest such 
7-covering for F. 

Let S* C ^ be a finite set. We consider coverings with respect to the Lp{S) norm for p > 1, defined by 

11/11^5) = f^E 1/(^)1') ■ 

Forp = cx), Lao{S) is defined by ||/||loo(S) — niax^gs 1/(5') |. The covering number of J-" for a sample 
size m with respect to the Lp norm is 

AC.(7,-^,P)= sup M{j,F,LpiS)). 

S<ZA:\S\=m 
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A small covering number for a function class implies faster uniform convergence rates, hence smaller 
sample complexity for learning. The following lemma bounds the covering number of bag hypothesis- 
classes whenever the bag function is Lipschitz with respect to the infinity norm (see Def.|4|i. Recall that all 
extensions of monotone Boolean functions (Def. ^ and all p-norm bag-functions (Def. O are 1-Lipschitz, 
thus the following lemma holds for them with a = 1. 



Lemma 12 Let i? C N and suppose the bag function -0 : MS > — > M is a-Lipschitz with respect to the 
infinity norm, for some a > 0. Let S C X^^' be a finite set of bags, and let r be the average size of a bag 
in S. For any 7 > 0, p G [1, 00], and hypothesis class H C M*^, 

Proof First, note that by the Lipschitz condition on tp, for any bag x of size n and hypotheses h,g G H, 

IMx) - 5(x)| = mh{x[l]), ..., hix[n])) - V(3(x[l]), . . .,gix[n]))\ < amax \h{x) - gix)\. (5) 

Let C be a minimal 7-cover of H with respect to the norm defined by Lp{S'~'), so that \C\ = 
Af{'^,H,Lp{S'~')). For every h G H there exists a g E C such that ||^ — ffHi (s^) < 7- Assume p < 00. 
Then by Eq. © 






\ i/p 

x)\A 



\S\ 
= ari/P||/j-5|U^(Su)<ari/P.7. 

It follows that C is a (ar^/P7)-covering for "H. For p = 00 we have 

\\h--'g\\L^{S) = max|7^(x) -g(x)| < amaxmax |/i(x) - g{x)\ 

= am&x\h{x)-g{x)\ = a||/i - 5r||L^(su) < 07 = a • r^/^ • 7. 

Thus in both cases, C is a ar^/''7-covering for H, and its size is Af{'j, H, Lp{S^)). Thus 

AA(ari/P7,W,ip(5^))<AA(7,W,ip(5^)). 

We get the statement of the lemma by substituting 7 with 7/p - ^ 

As an immediate corollary, we have the following bound for covering numbers of a given sample size. 

Corollary 13 Let r E N, and let R C [r]. Suppose the bag function ip : W-^^ — > M is a-Lipschitz with 
respect to the infinity norm for some a > 0. Let j > 0,p E [1, 00], and H E K*^. For any m > 0, 

Mmh,n,p) <Km{—^,n,p). 

a ■ r'-'P 
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5. Margin Learning for MIL 

Large-margin classification is a popular superv ised learning approac h, which has received attention also 
as a method for MIL. For instanc e, MI-SVM ( Andrews et all 120021) attempts to optimize an adaptation 
of the soft-margin S VM objective (ICortes and Vapnila 119951) to MIL, in which the margin of a bag is the 
maximal margin achieved by any of its instances. It has not been shown, however, whether minimizing the 
objective function of MI-SVM, or other margin formulations for MIL, allows learning with a reasonable 
sam ple size. We fill in this gap in Theorem [14] below, which bounds the 7-fat-shattering dimension (see 



e.g. lAnthonv and Bartletllll999 ) of MIL. The objective of MI-SVM amounts to replacing the hypothesis 



class Ti of separating hyperplanes with the class of bag-hypotheses H where the bag function is ^ = max. 
Since max is the real-valued extension of OR, this objective function is natural in our MIL formulation. 
The distribution-free sample complexity of lar ge-margin learning with the zero-one loss is proportional to 
the fat-shattering dimension (lAlon et al.l 1 1 997.) . Thus, we provide an upper bound on the fat-shattering 
dimension of MIL as a function of the fat-shattering dimension of the underlying hypothesis class, and of 
the maximal allowed bag size. The bound holds for any Lipschitz bag-function. Let 7 > be the desired 
margin. For a hypothesis class H, denote its 7-fat-shattering dimension by Fat (7, H) 

Theorem 14 Let r e N and assume R C [r]. Let B,a > 0. Let H C [0, B]'^ be a real-valued hypothesis 
class and assume that the bag function tp : [0, i?]*-^' — > [0, aB] is a-lipschitz with respect to the infinity 
norm. Then for all ^ e {Q,aB] 



Fat(7,-H) < max<i 33, 24Fat(— ,-H) log^ 

64a 



7^ 64a 



(6) 



This theorem shows that for margin learning as well, the dependence of the bag size on the sample com- 
plexity is poly-logarithmic. In the proof of the theorem we use the following two results, which link the 
covering number of a function class with its fat-shattering dimension. 



Theorem 15 (lAnthonv and Bartletlill999l Theorem 12.10) Let F be a set of real-valued functions and 
let j>0. Form> Fat(167, F), 

e^^'(i«^'^)/«<A/-„(7,F,oo). 



Theorem 16 dAnthonv and Bartletllll999l Theorem 12.8) Let F be a set of real-valued functions with 
range in [0,B]. Let 7 > 0. For all m > 1, 



Knh,F,^) <2 



4B^m 



(7) 



Theorem 12.8 in lAnthonv and Bartleta (119991) deals with the case m > Fat(^, F). Here we only require 
m > 1, since if m < Fat(^) then the trivial upper bound TV™ (7, H, 00) < (5/7)" < [B / -^Y'^^^i'^ 
implies Eq. (|7]i- 
Proof [of Theorem fl4l From TheoremfTSJand LemmafT2]it follows that for m > Fat(167, Ti), 



Fat(167,H) < logM„(7,H,(X)) < 61ogA/'„„(7/a,H, 00). 

lege 



(8) 
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By Theoremim for all m > 1, if Fat(7/4) > 1 then 

V7 < — , logA/;.(7,^,<^) < l + Fat(4,H)log(^— )log ( —^ 



ze 4 7 \ 7 

:J,H)log(-^)log(^— 

:-,H)log (— 



<Fat(j,H)log(^— )log(— ^) (9) 



<Fat(j,H)log2(^^). (10) 



The inequality in line (|9]l holds since we have added 1 to the second factor, and the value of the other factors 
is at least 1. The last inequality follows since if 7 < ^, we have 8eB/^ < AB^/j'^. Eq. ( fTOb also holds if 
Fat(7/4) < 1, since this impHes Fat(7/4) = and Um{l, Ti., 00) = 1. Combining Eq. (O and Eq. ( fTOb . 
we get that if m > Fat (1 67, T-L) then 



f, Fat(167,W)<6Fat(f,H)log^(l^'^"^' 
2e 4a 7^ 



V7<— , Fat(167,H) <6Fat(^,H)log2( ^ ). (11) 



Set TO = [Fat(167,'H)l < Fat(167,'H) + 1. If Fat(167,'H) > 1, we have that to > Fat(167,'H) and also 
TO < 2Fat(167,H). Thus Eq. dTB holds, and 



^, Fat(167,H)<6Fat(f,H)log2(^"' 
2e 4a 7^ 



V7 < 7^, Fat(167,-H) < 6Fat(^,-H)log^( j— • r • (Fat(167,-H) + 1)) 



Qo2 2 

< 6Fat(-^,H) log'( ^ • r • Fat(167,H)). 

4a 7^ 

Now, it is easy to see that if Fat(167,H) < 1, this inequality also holds. Therefore it holds in general. 
Substituting 7 with 7/I6, we have that 



D4a 7^ 



V7< , Fat(7,H)<6Fat(-^,^)log2( r-Fat(7,H)). (12) 



Note that the condition on 7 holds, in particular, for all 7 < aB. 

To derive_the desired Eq. © fromEq. O, let /3 = 6Fat(7/64a,?^) and r/ = 20485^02/72. Denote 
F = Fat(7, H). Then Eq. dHli can be restated as F < ;3 log^ (r/rF). It follows that \/F/ log(77rF) < V^, 
Thus 



^ log f r4^) < ^iog(. 



Therefore 



hence 



log(77rF) \log{rirF) J 
F 



f3r]r). 



\og{r]rF) 



(log(r7rF)/2 - log(log(77rF))) < ^\ogi(3r]r)/2, 



2log{\og{r]rF)) , 



Now, it is easy to verify that log(log(a;))/log(a;) < j for all a; > 33 • 2048. Assume F > 33 and 
7 < aB. Then 

r]rF = 20A8B^a'^rF/j^ > 2048F > 33 • 2048. 

Therefore log(log(?7rF))/log(7/7-F) < j, which implies ^vF < ^/p\og{(3r]r). Thus F < 4/3 log {(3r]r). 
Substituting the parameters with their values, we get the desired bound, stated in Eq. (|6]l. ■ 
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6. Sample Complexity by Average Bag Size 

The upper bounds we have shown so far provide distribution-free sample complexity bounds, which depend 
only on the maximal possible bag size. In this section we show that even if the bag size is unbounded, we 
can still have a sample complexity guarantee, if the average bag size for the input distribution is bounded. 
For this analysis we use the notion of Rademacher complexity (Bartl ett and Mendelsoni 12002 '). Let A be 



some domain. The empirical Rademacher complexity of a class of functions T C R^'^^ i.+i} with respect 

to a sample S — {(a:^i,2/j)}ie[m] C A x { — 1, +1} is 



Tl[F,S) ^ — IEo-[| sup V" aif{xi,yi 



m fejr^ 



where a ~ (cti, . . . , am) are m independent uniform {±l}-valued variables. The average Rademacher 
complexity of J^ with respect to a distribution D over A x {—1,+!} and a sample size m is 

The worst-case Rademacher complexity over samples of size m is 

7^™p(J■)= sup 7^(J■,5). 

SCA™ 

This quantity can be tied to the fat-shattering dimension via the following result: 

Theorem 17 (See e.g.'Mendelsonl (l2002h . Theorem 4.11) Let m > I and j > 0. IfW;^P{T) < 7 then 



the ^ -fat- shattering dimension of T is at most m. 

Let /CM. Assume a hypothesis class H Q I^ and a loss function C : {—1,+!} x / — > M. For a 
hypothesis h E H, we denote by he the function defined by hi{x, y) = £{y, h{x)). Given H and £, we 
define the function class Hg = {hi \ h G H} C M^><{-i'+i}. 
Rademacher complexities can be used to derive sample complexity bounds (IBartlett and Mendelsoni 



20021): Assume the range of the loss function is [0, 1]. For any S G (0, 1), with probability of 1 — i5 over the 



draw of samples S C A x {— l,+l}of size m drawn from D, every h E H satisfies 



£{h, D) < e{h, S) + 2nm{Hi, D) + J^M^. (13) 

V m 

Thus, an upper bound on the Rademacher complexity implies an upper bound on the average loss of a 
classifier learned from a random sample. 

6.1 Binary MIL 

Our first result complements the distribution-free sample complexity bounds that were provided for binary 
MIL in Section [3] The average (or expected) bag size under a distribution D over X^^'^ x {— 1, +1} is 
E(x y')^D[l^l]- Our sample complexity bound for binary MIL depends on the average bag size and the VC 
dimension of the instance hypothesis class. Recall that the zero-one loss is defined by ^0/1 (j/i v) = 1 [2/ 7^ 
y\. For a sample of labeled examples S = {{xi, 2/i)}ig[m], we use Sx to denote the examples of S, that is 

Sx = {Xi\i(z[m\- 

Theorem 18 Let T-L C {—\^+\}^ be a binary hypothesis class with VC-dimension d. Let R C N and 
assume a bag function tp : { — 1, +1}'^' — > { — 1, +!}• Let r be the average bag size under distribution D 
over labeled bags. Then 



14 



Multi-Instance Learning with Any Hypothesis Class 



Proof Let 5 be a labeled bag-sample of size m. Dudley's entropy integral (lDudlevill967h states that 



n{'Ht,,n.s)< 



12 



lnAr(7,H,„/,,L2(5))rf7- 



(14) 



If C is a 7-cover for Ti. with respect to the norm L2{Sx), then Ce^.-^ is a 7/2-cover for Hig^ with respect 
to the norm L2{S). This can be seen as follows: Let hig^-^ G V-ea/i for some h £ H. Let / G C such that 

II /- /ill L.(Sx)< 7- We have 

1/2 



- E \eo/iiyJ{x))-io/i{y,hix))(^ 



1/2 



{x,y)<£S 






1/2 



xeSx 



-||/-/i||L.(5x)<7/2. 



Therefore C^^ .^ is a 7/2-cover for L2 (S). It follows that we can bound the 7-covering number of H^^ , ^ by: 



AA(7, Hi,^, , L2{S)) < AA(27, H, L2{Sx)). 
Let r(S') be the average bag size in the sample 5, that is r{S) = |S"-^|/|S'|. By Lemma[T2l 



M{^,n,L2iSx))<M{-f/^Ms),n,L2is^)). 

From Eq. (fT4l) . Eq. ( fTsT i and Eq. ( fTSI l we conclude that 

^^ ^V-^(27/VK5),H,L2(^^))d7- 



(15) 



(16) 



n{ni„,,s)< 



Bv iDudlevi (1 1 978h . for any H with VC-dimension d, and any 7 > 0, 

\nAf{^,H,L2{S'i)) <2d\n(^ 
Therefore 



7^(H.„,, , 5) < i| ^' y^l^i!^^^^ d^ 



^ / d(ln(er(5)) + V^) ^ ^^ /dln(4er(5)) 



The function ^ln(x) is concave for x > 1. Therefore we may take the expectation of both sides of this 
inequality and apply Jensen's inequality, to get 



TZm{nig,,,D) = Es^D-[7^(Hvi'^)] ^ lEs^S" 
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dln(4er(S')) 



^ ^ dHAe-Es^pr^jriS)]) ^ ^ /dln(4er) 



m 
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We conclude that even when the bag size is not bounded, the sample complexity of binary MIL with a 
specific distribution depends only logarithmically on the average bag size in this distribution, and linearly 
on the VC-dimension of the underlying instance hypothesis class. 

6.2 Real- Valued Hypothesis Classes 

In our second result we wish to bound the sample complexity of MIL when using other loss functions that 
accept real valued predictions. This bound will depend on the average bag size, and on the Rademacher 
complexity of the instance hypothesis class. 

We consider the case where both the bag function and the loss function are Lipschitz. For the bag 
function, recall that all extensions of monotone Boolean functions are Lipschitz with respect to the infinity 
norm. For the loss function i : {— l,+l}xR— >-R, we require that it is Lipschitz in its second argument, 
i.e. that there is a constant a > such that for all y G {—1, +1} and 2/1,2/2 G K, |£(j/, y\) — i{y, 2/2)! < 
o-\yi ^ 2/2 1- This property is satisfied by many popular losses. For instance, consider the hinge-loss, which 
is the loss minimized by soft-margin SVM. It is defined as ihi{y, y) = [1 ~ 2/y]+> ^"d is 1-Lipschitz in its 
second argument. 

The following lemma provides a bound on the empirical Rademacher complexity of MIL, as a function 
of the average bag size in the sample and of the behavior of the worst-case Rademacher complexity over 
instances. We will subsequently use this bound to bound the average Rademacher complexity of MIL with 
respect to a distribution. We consider losses with the range [0, 1]. To avoid degenerate cases, we consider 
only losses such that there exists at least one labeled bag {5c, y) C A"'^^ x { — 1,+!} and hypotheses 
h,g ^ H such that /i^ (x, y) = and (/^(x, y) = 1. We say that such a loss has a full range. 

Lemma 19 Let V. C [0, B]'^ be a hypothesis class. Let i? C N, and let the bag function ip : R'-^^ -^ Rbe 
ai-Lipschitz with respect to the infinity norm. Assume a loss function £ : {—1, +1} x K — > [0, 1], which is 
a2-Lipschitz in its second argument. Further assume that £ has a full range. Suppose there is a continuous 
decreasing function / ; (0, 1] — > M such that 

V7e(0,l], /(7)eN=»7^}';P)(H)<7• 
Let S be a labeled bag-sample of size m, with an average bag size r. Then 



n{ne,S) <4e+^ log { ^^ 1+/ \fil^)dl 



Proof A refinement of Dudley's entropy integral JN. et al.ll201(A Lemma A. 3) states that for all e G (0, 1], 



for all real function classes JT with range [0, 1] and for all sets S, 

in /•! , 

■R{F,S)<Ae+^ y/lnMij,:F,L2iS))d-f. (17) 

V^ Jt 

Since the range of ^ is [0, 1], this holds for F — He. In addition, for any set S, the L2{S) norm is bounded 
from above by the Loo{S) norm. Therefore 7V(7, J", ^2(6')) < 7V(7, J", ^00(5)). Thus, by Eq. (O we 
have 

n{m,S) < 4e+ 4^ / JlnMij,ne,Loo{S)) d-f. (18) 

Now, let h,g E H and consider ft.^, g^ G He. Since £ is a2-Lipschitz, we have 

Wh-Vih^iS) = max|7i^(x,,2/j) - g<> (x, , 2/i ) I = max |£(2/j,7i(xO) ~ Ky^l9{^i))\ 

iG [ni\ iG [m\ 

< a2max|7i(xi) ~g{xi)\ = a2||7i- 5||l„„(S'x)- 
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It follows that if C C "H is a 7/a2-cover for H then Ce C He is a 7-cover for He- Therefore 

^fh,Hi,L^iS))<^f{l/a2,n,L^iSx)). ByLemmaini 

^f{J/a2,%L^{Sx))<^fh/ala2,H,Loo{S'i)) < AA„„ (7/0102, H, to). 
Combining this with Eq. ( fTSl l it follows that 

10 



nine,S)<Ae+^ y/Kmil/aia2,n,oo)d^. (19) 



Now, let 7 e (0, 1], and let 70 = sup{7o < 7 | /(70) G N}. Since '^/"!^^)('H) < 7o, by Theoremln] 
the 7o -fat-shattering dimension of "H is at most f{jo)- It follows that 

Fat(7,H) < Fat(7o,H) < /(70) < l + fh)- 

The last inequality follows from the definition of 70, since / is continuous and decreasing. Therefore, by 
Theorem[T6l 

V7<i?, logAA™(7,H,oo)<l + (/(4) + l)log( )lor' 



2 



4 7 \ 7 

<(/(i) + l)log(— — )log(^— -^) (20) 



4 7^ 



<(/(4) + l)V(^^). (21) 



The inequality in line (|20] | holds since we have added log(e) > 1 to the third factor, and the value of the 
other factors is at least 1. The last inequality follows since 7 < -B. 

We now show that the assumption j < B does not restrict us: By the assumptions on £, there are 
h,g E H and a labeled bag (x, y) such that hg{x, y) = 1 and g^ (x, y) — 0. Let n — |x|. By the Lipschitz 
assumptions we have 

1 = |7^,(x,y) -5,(x,y)| = \e{y,h{^)) ^ e{y,g{5c))\ < a2\h{^)-m)\ 
= a2\ilj{h{x[l]), . . . , h{x[n\)) - ip{g{x[l]), . . . , g{x[n]))\ < 0301 max |/i(x[j]) - g{x[j])\ < aiOzB. 

Thus 1 < aia2-B. It follows that for all 7 e (0, 1], 7/0102 < B. Thus Eq. (ISTT i can be combined with 
Eq. ([T9]l to get that 



,— , 10 r // , 7 , \ n /4ea?a2B2rm , 
7^(H,,5)<4e+— / J(/(^) + ljlog2(^^ )d7 



10 /4eafa2i3^rTO\ f [7 7 



^''^+^'°H^^^J/, V^'s^'^"-' 



10 , f Aea^,alB^rm\ ( [^ L, 1 . 



The last inequality follows from the fact that ^/a + h < ^/a + \/b for non-negative a and b, and from 
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Based on Lemma [19] we will now bound the average Rademacher complexity of MIL, as a function 
of the worst-case Rademacher complexity over instances, and the expected bag size. Since the number 
of instances in a bag sample of a certain size is not fixed, but depends on the bag sizes in the specific 
sample, we will need to consider the behavior of Tl^^{H) for different values of m. For many learnable 

function classes, the Rademacher complexity is proportional to — ^, or to " )^'' for some non-negative 
(3. The following theorem bounds the average Rademacher complexity of MIL in all these cases. The 
resulting bound indicates that here too there is a poly-logarithmic dependence of the sample complexity on 
the average bag size. Following the proof we show an application of the bound to a specific function class. 

Theorem 20 Let % C [0, B]'^ be a hypothesis class. Let i? C N, and let the bag function ^ : M^^^ — > R 
be ai-Lipschitz with respect to the infinity norm. Assume a loss function i : {—1, +1} x M — >■ [0, 1], which 
is a2-Lipschitz in its second argument. Further assume that £ has a full range. Suppose that there are 
C, 13, K > such that for all m > K, 



7^r(H)<-^'"'^"^ 



Then there exists a number N > that depends only on C, /3 and K such that for any distribution D with 
average bag size r, and for all m > 1, 

_ 4 + lOlogiAealalB^rm^) (n + ^Cln'^+\l6alalm) 
nWHe,D)< ^ 



Proof Let S* be a labeled bag sample of size m, and let r be its average bag size. Denote T{x) — C In^ [x], 

and define 7(7) — 2 ■ We will show that Ti-^^F-^ < 7, thus allowing the use of Lemma[T9l We have 

Km < T{m)/.Jm, thus it suffices to show that T(/(7))/V/(7) - 7- 

Let z(7) = \/f{l)/T{f {-/)). We will now show that z{-/)T{z^{-f)) > ^T{l/j^). Since the function 

xT{x^) = Cx \\r [x'^) is monotonic increasing for a; > 1, we will conclude that 2(7) > I/7 for all 7 < 1. 
It is easy to see that for all values of /3, C > 0, there is a number n > such that for all x > n, 

CHn^P{x)<x'-^'"". 
For such X we have 

T{x/T\x)) = Cln^(^j^^^) = C{Hx) - HC^ ln^^(a:)))'^ 

> C{\n{x) - (1 - 2-1/'^) ln(a;)))'^ = C\n^{x)/2 ^ T{x)/2. (22) 

Let 7o > such that /(70) — k — max{n, K}. Since /{j) is monotonic decreasing with 7, for all 7 < 70, 
fil) ^ k. Therefore, for 7 < 70, 



V7(7)^, /(7) ,^1 V7(7)^,» ,,_1 



<l)nz\,)) = ^^fij4^)) > -2m^f(f(^^^ = 2 V7M = T(l/7^)/7. 

The middle inequality follows from Eq. (l22l l. and the last equality follows from the definition of f{'j). We 
conclude that 2(7) > -. Therefore, for all 7 < 70, 



7 
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Define / as follows: 



/(7) = 



fil) 7 < 7o 
k 7 > 7o. 



For 7 < 7o, clearly TZ'^^^ {H) < 7, and for 7 > 70, 



fh 



nj^^in) = 7^^(7o)(H) = nj^^jn) < 70 < 7. 



Therefore for all 7 e (0, 1], '^'^Ts ("H) < 7. By Lemina[T9l we have 






10 , f 'ieaialB'^fm 
4e + ^ log ^-\ 



4aia27o 



4aia27o 



1+ / V^d7+ / ' J/(t^^)««7 



4aia2' 



< 



10 , ( AealalB'^fm\ / /- f 



4aia27o 



^(i^)^- 



(23) 



Denote A^ = 1 + \/fc. Now, if /3 > we have 

(■4aia27o 



4aia2 J<: V 4aia2 Je 7 



2aia2C / — 2aia2C 

Je 7 

aia2C / /3+i. l6aiai , 
^+1 \ ^ £2 ^ 



ln^+^(^)/(2(/3 + l)) 



The same inequahty holds also for /3 ~ 0, since in that case 

f4aia27, 



/ 



/(-^)d7 = 2aia2 / ^ ^^^^ ^7 



2aia2C 



4aia2 

4aia27o 1 



7 



- = 2aia2C[ln(7)]f ^"^^^^ = 2aia,Cln( ^°^°'^° ) 

7 e 






>C_ A ;9+i^l6a|a|A 



Therefore we can further bound Eq. ( |23] ) to get 

7^(H., ^) < 4. + ii log f 4ea^ii?2f.^ ^ ( ^ ^ -_p^ i^/.+i(^i-2 
Vm \ e"^ J \ P + 1 e 



01026* ^+1 16a|a| 



Setting e = 1/^/m we get 



n{ni,s)< 



A+10\og{AealalB^Pm^) (n + 2^^lTi>^+\l6alalm)\ 
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Now, for a given sample S denote its average bag size by r{S). We have 




In the last inequality we used Jensen's inequality and the fact that Es^ij™ [r{S)] = r. This is the desked 
bound, hence the theorem is proven. ■ 

To demonstrate the implications of this theorem, consider the case of MIL with soft-margin kernel 
SVM. Kernel S VM can operate in a general Hilbert space, which we denote by T. The domain of instances 
is X = {x E T \ \\x\\ < 1}, and the function class is the class of linear separators with a bounded norm 
y^{C) — {hyj \ w & T, \\w\\ < C}, for some C > 2, where h^, = ( x, w). The loss is the hinge-lo ss £^1 
defined above, which is 1-Lipschitz in the second argument. We have (IBartlett and MendelsonL l2002h 



^r(mckj<^-^^^"(™) 



m \ m 



Thus we can apply Theorem|20]with /3 = 0. Note that Yiil(C) C [— C, C]'^, thus we can apply the theorem 
with B = 2C by simply shifting the output of each h.^ by C and adjusting the loss function accordingly. 
By Theorem |20] there exists a number N such that for any 1-Lipschitz bag-function ip (such as max) and 
for any distribution D over labeled bags with an average bag size of r, we have 

4:+10log{16eB^rm.^){N + Cln{16m)) 

'ri-m\Til,D) < 



/TO 

We can use this result and apply Eq. ( fT3] l to get an upper bound on the loss of MIL with soft-margin SVM. 

7. PAC-Learning for MIL 

In the previous sections we addressed the sample complexity of generalized MIL, showing that it grows 
only logarithmically with the bag size. We now turn to consider the computational aspect of MIL, and 
specifically the relationship between computational feasibility of MIL and computational feasibility of the 
learning problem for the underlying instance hypothesis. 

We consider real-valued hypothesis classes H E [—1, +1]'^, and provide a MIL algorithm which uses a 
learning algorithm that operates on single instances as an oracle. We show that if the oracle can minimize 
error with respect to H, and the bag-function satisfies certain boundedness conditions, then the MIL algo- 
rithm is guaranteed to PAC-learn H. In particular, the guarantees hold if the bag-function is Boolean OR or 
max, as in classical MIL and its extension to real-valued hypotheses. 

Given an algorithm A that learns H from single instances, we provide an algorithm called MILearn that 
uses A to implement a weak learner for bags with respect to T-L. That is, for any weighted sample of bags, 
MILearn returns a hypothesis from % that has some success in labeling th e bag-sample correctly. Thi s 



will allow the use of MILearn as the building block in a Boosting algorithm (F reund and Schapirell 19970 



which will find a linear combination of hypotheses from % that classifies unseen bags with high accuracy. 
Furthermore, if A is efficient then the resulting Boosting algorithm is also efficient, with a polynomial 
dependence on the maximal bag size. 
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We open with background on Boosting in Section rTTTI We then describe the weak learner in and analyze 
its properties in Section ItTSI In Section |73] we provide guarantees on a Boosting algorithm that uses our 
weak leaner, and conclude that the computational complexity of PAC-learning for MIL can be bounded by 
the computational complexity of agnostic PAC-learning for single instances. 

7.1 Background: Boosting with Margin Guarantees 

In this section we give some background on Bo osting algorithms, which we will use to derive an efficient 



learning algorithm for MIL. Boosting methods (IFreund and Schapirel 119971) are techniques that allow en 



hancing the power of a weak learner — a learning algorithm that achieves error slightly better than chance — 
to derive a classification rule that has low error on an input sample. The idea is to iteratively execute the 
weak learner on weighted versions of the input sample, and then to return a linear combination of the 
classifiers that were emitted by the weak learner in each round. 

Let A be a domain of objects to classify, and let H : [— 1, +1]^^ be the hypothesis class used by the weak 
learner. A Boosting algorithm receives as input a labeled sample S — {(a;^, i/i)}"=i ^ ^ x {~lj +1}. and 
iteratively feeds to the weak learner a reweighed version of S. Denote the m-dimensional simplex by 
A™ = {w e M- I Y.^e[^] w^ - l,Vi e [m],w[i] > 0}. Foravectorw e A„„ S^ = {iw[iix,,y,)}Zi 
is the sample S reweighed by w. The Boosting algorithm runs in k rounds. On round t it sets a weight 
vector wj G A„i, calls the weak learner with input S'wt, and receives a hypothesis ht E H as output 
from the weak learner After k rounds, the Boosting algorithm returns a classifier /o : A — > [—1, +1], 
which is a linear combination of the hypotheses received from the weak learner: /o = J2te\k] '^tht, where 
ai, . . . ,afc e M. 

The literature offers plenty of Boosting algorithms with desirable properties. For concreteness, we 



use the algorithm AdaBoost* dRatsch and Warmuthl 12005!) . since it provides suitable guarantees on the 



margin of its output classifier For a labeled example (x, y), the quantity yfo{x) is the margin of /o when 
classifying x. If the margin is positive, then sign o /o classifies x correctly. The margin of any function / 
on a labeled sample S — {{xi, j/i)}™ i is defined as 

M{!,S)^ mmyJix,). 

iE [rn] 

If M{f, S) is positive, then the entire sample is classified correctly by sign o /. 

If S is an i.i.d. sample drawn from a distribution on^ x { — l.+l}, then classification error of /o on 
the distribution can b e bounded based on Mjfo, S ) and the pseudo-dimension d of the hypothesis class H. 



The following bound OSchapire and Singen,ll999l Theorem 8) holds with probability 1 — 5 over the training 
samples, for any m> d: 



MX) < 0] < O I Jdln\m/d)/MHMS)+Hm ^ ^,,^ 



In fact, inspection of the proof of this bound in lSchapire and Singeii ( 119991) reveals that the only property of 



the hy pothesis class H that is used to achieve this result is the following bound, due to lHaussler and Long 



(119951) . on the covering number of a hypothesis class H with pseudo-dimension d: 



V7G(0,1], M„(7,W,oo)< (— j . (25) 

Thus, Eq. (|24] | holds whenever this covering bound holds — a fact that will be useful to us. 

For AdaBoost*, a guarantee on the size of the margin of /o can be achieved if one can provide a 
guarantee on the edge of the hypotheses returned by the weak learner The edge of a hypothesis measures 
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of how successful it is in classifying labeled examples. Let h : A —>■ [—1, +1] be a hypothesis and let D be 
a distribution over A x {— 1,+1}. The edge of h with respect to D is 

rih,D)^E^x^Y)^D[Y-h{X)]. 

For a weighted and labeled sample S = {{wi, x^, 2/i)}ig[„] C R_^_ x A x { — 1, +1}, 

T{h,S) = ^ Wiyih{xi). 

Note that if h{x) is interpreted as the probability of h to emit 1 for input x, then \^ — - is the expected 

misclassification error of h on D. Thus, a positive edge implies a labeling success of more than chance. For 
AdaBoost*, a positive edge on each of the weighted samples fed to the weak learner suffices to guarantee 
a positive margin of its output classifier Jo- 



Theorem 21 (IRiitsch and WarmuthlllOOSh Assume AdaBoost* receives a labeled sample S of size m as 



input. Suppose that AdaBoost* runs for k rounds and returns the classifier fa- If for every round t G [k], 
T{ht, S'wJ > p, then M{fo, S) > p — y^2\nm/k. 

We present a simple corollary, which we will use when analyzing Boosting for MIL. This corollary 
shows that AdaBoost* can be used to transform a weak learner that approximates the best edge of a 
weighted sample to a Boosting algorithm that approximates the best margin of a labeled sample. The 
proof of the corollary employs the foll owing well known result, originally by Ivon NeumannI (119281) and 



later extended (see e.g. iNash and Soferl Il996) . For a hypothesis class H, denote by co{H) the set of all 



linear combinations of hypotheses in H. We say that H C [—1, +1]"^ is compact with respect to a sample 

S — {{xi, yi)}ie[m] C A X {—1, +1} if the set of vectors {{h{xi), . . . , h{xm)) \ h e H} is compact. 

Theorem 22 (The Strong Min-Max theorem) IfH is compact with respect to S, then 

min sup r(/i, S'w) = sup M{f,S). 

weA,„ ,jg^ /6co(_ff) 

Corollary 23 Suppose that AdaBoost* is executed with an input sample S, and assume that H is compact 
with respect to S. Assumpe the weak learner used by AdaBoost* has the following guarantee: For any 
w € A„i, if the weak learner receives 5w as input, then with probability at least l — S it returns a hypothesis 
ho such that 

r(/io, S'w) > 5(sup T{h, 5w)), 

heH 

where g : [— 1,+1] — > [—1, +1] is some fixed non-decreasing function. Then for any input sample S , if 
AdaBoost* runs k rounds, it returns a linear combination of hypotheses /o = X^teffcl ^^tht, such that with 
probability at least 1 — kS 



M{fo,S)>g{ sup A/(/,5))- V21nm/fc. 
/eco(ff) 

Proof By Theorem |22] miuweA^ sup^j^^ r(/i, 5w) — s,n^ffzco{H) ^Hf^ ^)- Thus, for any vec- 
tor of weights w in the simplex, sup^j^^ r(/i, S'w) > ^'^'^Pfeco(H) ^Hfi ^)- ^^ follows that 
in each round, the weak learner that receives S-^t as input returns a hypothesis ht such that 
T{ht,S^J > g{sup,^^Hr{h,S^J) > g(supygj,o(^) Af (/, 5)). By Theorem [21] it follows that 
Af (/o, 5) > g(sup/g,„(^) A/(/, S)) - ^2lnm/k. U 
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7.2 The Weak Learner 

In this section we will present our weak learner for MIL and provide guarantees for the edge it achieves. 
Our guarantees depend on boundedness properties of the bag-function i/;, which we define below. To 
motivate our definition of boundedness, consider the p-norm bag functions (see Def.O, defined by V^p(z) = 

(n S"=i(-^[*] + 1)^) ~ 1- Recall that this class of functions includes the max function (tpoo) and the 
average function (-01) as two extremes. Assume R C [r] for some r e N. It is easy to verify that for any 
natural n, any sequence zi, . . . , z„ G [—1, +1], and all p E [1, oo], 

-y^Zi< V'p(2i, ■■■,Zn) < y^Zi + n-l. 
n ^ — ' ^ — ' 

J6[n] i£[n] 

Since R C [r], it follows that for all (zi, . . . , z„) £ [-1, +1](^\ 

-^ Zi< i>p{zi, ...,Zn)< ^ Zi + r - 1. (26) 

We will show that in cases where the bag function is linearly bounded in the sum of its arguments, 
as in Eq. ( l26l l. a single-instance learning algorithm can be used to learn MIL. Our weak learner will be 
parameterized by the boundedness parameters of the bag-function, defined formally as follows. 

Definition 24 A /wncf/on ?/) : [— 1,+1](-'^' -^ [— I. +1] is {a,b,c, d)-hounded if for all {zi, ... ,Zn) G 

a'^ Zi + b< ip{zi, . . . , z„) < c y^ Zt + rf. 

Thus, for all p e [l,oo), ipp over bags of size at most r is {-,0,l,r — l)-bounded. 

Before listing the weak learner MILearn, we introduce some notations, hpos denotes a special bag- 
hypothesis that labels all bags as +1: Vx G X'^^\ hpos(2;) = 1- We denote 'H+ = H U {hpos}. Let 
A be an algorithm that receives a labeled and weighted instance sample as input, and returns a hypothesis 
h G H. The result of running A with input S is denoted A{S) G H. 

The algorithm MILearn, listed as Algorithm [T] below, accepts as input a bag sample S and a bounded 
bag-function tp. It also has access to the algorithm A. We sometimes emphasize that MILearn uses a 
specific algorithm A as an oracle by writing MILearn-^. MILearn constructs a sample of instances Sj from 
the instances that make up the bags in S, labeling each instance in Sj with the label of the bag it came from. 
The weights of the instances depend on whether the bag they came from was positive or negative, and on 
the boundedness properties of ip. Having constructed Sj, MILearn calls A with Sj. It then decides whether 
to return the bag-hypothesis induced by applying tp to A{Si), or to simply return hpos. 

It is easy to see that the time complexity of MILearn is bounded by 0{f{N) + N), where N is the 
total number of instances in the bags of S, and /(n) is an upper bound on the time complexity of A when 
running on a sample of size n. As we presently show, the output of MILearn is a bag-hypothesis in H+ 
whose edge on S depends on the best achievable edge for S. 

The guarantees for MILearn-^ depend on the properties of ^. We define two properties that we consider 
for A. The first property is that the edge of the hypothesis A returns is close to the best possible one on the 
input sample. 

Definition 25 (e-optimal) An algorithm A that accepts a weighted and labeled sample of instances in X 

and returns a hypothesis in % is e-optimal if for all weighted samples S C R^ x X x { — 1,+!} with total 

weight W, 

T{A{S), S) > sup r(/i, S) - eW. 
hen 
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Algorithm 1: MILearn^ 



Assumptions: 



• -He [-l,+l]'^ 



• Algorithm A receives a weighted instance sample and returns a hypothesis in H. 
Input: 

• S ~ {{wi, Xi, 2/i)}jg[„] — a labeled and weighted sample of bags, 

• ip — an (a, b, c, (i)-bounded bag-function. 

Output: ho e TL+. 

1 a(+i) <— a, c^(-i) "i^ c. 

2 5*7 ^ {(aj^; • Wi,Xi[j],yj)}ig[„]^jg[r]. 

3 hj ^ A{Si). 

4 ifr(7^/,5)>r(hpos,^)then 

5 ft-o *i— /l/, 

6 else 

7 \_ ho ^ hpos- 

8 Return /iq. 



The second property is that the edge of the hypothesis that A returns is close to the best possible one on the 
input sample, but only compared to the edges that can be achieved by hypotheses that label all the negative 
instances of 5* with —1. For a hypothesis class H and a distribution D over labeled examples, we denote 
the set of hypotheses in H that label all negative examples in D with —1, by 

n{H, D)={heH\ F^x,Y)r.D[hiX) = -1 I r = -1] = 1}. 

For a labeled sample S, ^(H, S) ~ ^{H, Us) where Us is the uniform distribution over the examples in 

S. 

Definition 26 (one-sided-e-optimal) An algorithm A that accepts a weighted and labeled sample of in- 
stances in X and returns a hypothesis in % is one-sided-e-optimal if for all weighted samples S C 
M+xAfxI-lj+l} with total weight W, 

T{A{S),S)> sup T{h,S)-eW. 
hencH.s) 

Clearly, any algorithm which is e-optimal is also one-sided-e-optimal, thus the first requirement from A is 
stronger. In our results below we compare the edge achieved using MILearn to the best possible edge for 
the sample S. Denote the best edge achievable for S* by a hypothesis in H by 

7* = sup r(/i, 5*). 
hen 
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We denote by ^*^ the best edge that can be achieved by a hypothesis m^{T-L,S). Formally, 

7; ^ sup_ V{h,-S). 
hen{-H,s} 

Denote the weight of the positive bags in the input sample S by VF+ = J2i-v=+i ^^ '^^'^ '^he weight of the 
negative bags by W_ = J2iv =-1 ^*' ^^ ^^^^ henceforth assume without loss of generality that the total 
weight of all bags in the input sample is 1, that is W+ + W- = 1. 

Note that for any (a, 6, c, (i)-bounded tp, if there exists any sequence zi,...,Zn such that 
■0(zi,...,z„) = -1, then 

a^ Zi + b< -1 <c^ z, + d. (27) 

JG[n] JS[n] 

This implies 

ie[n] 



d 
— < 



Rearranging, we get d— -&— - + 1>0, with equality if Eq. dZTJ ) holds with equalities. The next theorem 
provides a guarantee for MILearn that depends on the tightness of this inequality for the given bag function. 
As evident from Theorem|2T] to guarantee a positive margin for the output of AdaBoost* when used with 
MILearn as the weak learner, we need to guarantee that the edge of the hypothesis returned by MILearn is 
always positive. Since the best edge cannot be more than 1, we emphasize in the theorem below that the 
edge achieved by MILearn is positive at least when the best edge is 1 (and possibly also for smaller edges, 
depending on the parameters). We subsequently show how these general guarantees translate to a specific 
result for the max function, and other bag functions with the same boundedness properties. 

Theorem 27 Let r e E and R C [r]. Let ip : [— 1,+1](-'^) -^ [-!,+!] bean {a,b,c,d)-bounded bag- 
function such that < a < c. Let e £ [0, — ), and assume that d~-b— - + 1 — ri. Denote Z — -. 
Consider running the algorithm MILearn with a weighted bag sample S of total weight 1, and let ho be 
the hypothesis returned by MILearn . Then 

L If A is e-optimal then 

Z7*-Z+^-§(l + ^)-rce 



T{ho,S)> 
Thus, r(ft,o, S) > whenever 



1 + (1-§)(1-|) 



* -, 1 1 A I , rce 

In particular, iff] < 2(1 — rce)/{Z + 1) and j* = 1 then T(ho, S) > 0. 
2. If A is one-sided-e-optimal, and ^{zi, . . . , z„) = — 1 only if zi ~ . . . ~ Zn — —1, then 

, —. ll-mZ + l)-rceZ 

^^'-'^^ \z-\-^,lz-i) ■ 

Thus, T{ho,S) > whenever 

7;>|(Z + l) + rceZ. 

In particular, ifrj < 2(1 — rceZ)/(Z + 1) and ^'^ = 1 then T{ho, S) > 0. 
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The proof of the theorem is provided in Appendix lAl This theorem is stated in general terms, as it holds 
for any bounded ^. In particular, if tp is any function between an average and a max, including any of 
the p-norm bag functions tpp defined in Def. [3] we can simplify the result, as captured by the following 
corollary. 

Corollary 28 Lef H C [-1, +1]-^. Let R C [r], and e e [0, i). Assume a bag function V^ : [-1, +1]^^) -^ 
[—1, +1] such that for any zi, . . . , z„ € [—1. +1], 



-'^ Zi< ip{zi,. . . ,Zn) < max Zi. 



Let ho be the hypothesis returned by MILearn . Then 
7. If A is e-optimal for some e G [0, 1/r], then 

Thus r{ho, S) > whenever 7* > 1 j + -■ In particular, if-y* = 1 then r(ho, S) > 0. 

2. If A is one-sided-e-optimal some e £ [0, 1/r^], then 

Thus T{ho, S) > whenever 7^ > r^e. In particular, ifj*[_ = 1 then T{ho, S) > 0. 
Proof Let zi, . . . , z„ e [—1, +1]. We have 

maxzj < 2^, Zi — {n — 1) min(zi) < / z^ + n — 1. 

Therefore, by the assumption on i/j, for any n E R 

ipizi, ■.■,Zn) < ^ Zi + n-1 < y^ Zj +r- 1. 

iS[n] iS[n] 



In addition 



I II^*-~ ^2«<V'(^l,---,2n). 



r ■' — ' n 



Therefore ^p is {-,0,l,r — l)-bounded. It follows that Z = r in this case, and d— Zb — Z + 1 = 0. Claim 
(1) follows by applying case (1) of Theoreml27] with rj — 0. 

For claim (2) we apply case (2) of TheoremlZTl Thus we need to show that if i/;(zi, . . . , 2„) = —1 and 

zi, . . . , z„ € [—1, +1], then zi = . . . = z„ = —1. We have that 



-1 < - y^ Zi < il:{zi, . . . , z„) < -1. 



n 

i<£[n] 



Therefore ^ X^iefnl -^» ~ ^^- Since no Zi can be smaller than —1, zi = . . . = Zn — —1. Thus case (2) 
of Theorem 122] holds. We get our claim (2) directly by subsituting the boundedness parameters of ip in 
Theoreml27lcase (2). ■ 
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7.3 From Single-Instance Learning to Multi-Instance Learning 

In this section we combine the guarantees on MILearn with the guarantees on AdaBoost*, to show that 
efficient agnostic PAC-learning of the underlying instance hypothesis H implies efficient PAC-learning of 
MIL. For simplicity we formalize the results for the natural case where the bag function is ip = max. 
Results for other bounded bag functions can be derived in a similar fashion. 

First, we formally define the notions of agnostic and one-sided PAC-learning algorithms. We then show 
that given an algorithm on instances that satisfies one of these definitions, we can construct an algorithm 
for MIL which approximately maximizes the margin on an input bag sample. Specifically, if the input 
bag sample is realizable by H, then the MIL algorithm we propose will find a linear combination of bag 
hypotheses that classifies the sample with zero error, and with a positive margin. Combining this with the 
margin-based generalization guarantees mentioned in Section 17.11 we conclude that we have an efficient 
PAC-learner for MIL. 

Definition 29 (Agnostic PAC-learner and one-sided PAC-learner) Let B{e, S, S) be an algorithm that 
accepts as input S,e € (0, 1), and a labeled sample S €z {X x { — 1, +1})'", and emits as output a hy- 
pothesis h Cz H. B is an agnostic PAC-learner /or H with complexity c{e,S) ifB runs for no more than 
c(e, S) steps, and for any probability distribution D over X x {—1, +1}, if S is an i.i.d. sample from D of 
size c(e, 5), then with probability at least 1 — 5 over S and the randomization ofB, 

r(S(e, 6, S), D) > sup r(/i, D) ~ e. 
hen 

B is a one-sided PAC-learner if under the same conditions, with probability at least 1 — 5 

r{B{e,5,S),D)> sup r(/i, £>)-€. 
hen{n,D) 



Algorithm 2: Ofg 
Assumptions: 

• e,<5e(0,l). 

• B receives a labeled instance sample as input and returns a hypothesis in H. 

• Algorithm ,B is a one-sided (or agnostic) PAC-learning algorithm with complexity c(e, 5). 

Input: A labeled and weighted instance sample S — {{wi, Xi, yi)}i£[ni] ^ ^+ x X x { — 1, +1}. 
Output: A hypothesis in H 

1 For all ie [m], p, ^ w,/ Y.ie[m] ^i- 

2 For each t e [c(e, 5)], independently draw a random jt such that jt — i with probability pi. 

3 S ^ {{Xjt,yjJ}te[cieM- 

4 h ^ B{S) 

5 Return h. 

Given an agnostic PAC-learner B for H and parameters e, (5 G (0, 1), the algorithm Of^, listed above as 
Algorithmic is an e-optimal algorithm with probability 1 — 5. Similarly, if S is a one-sided PAC-learner, 
then Ofg is a one-sided-e-optimal algorithm with probability 1-5. Our MIL algorithm is then simply 

AdaBoost* with MILearn "^ as the (high probability) weak learner. It is easy to see that this algorithm 
learns a linear combination of hypotheses from H+. We also show below that under certain conditions this 
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linear combination induces a positive margin on the input bag sample with high probability. Given this 
guaranteed margin, we bound the generalization error of the learning algorithm via Eq. (l24l i. 

The computational complexity of Of ^ is polynomial in c(e, S) and in the instance-sample size m. There- 
fore, the computational complexity of MILearn'^''' is polynomial in c(e, 6) and in N, where N is the total 
number of instances in the input bag sample S. 

For 1-Lipschitz bag functions which have desired boundedness properties, both the sample complexity 
and the computational complexity of the proposed MIL algorithm are polynomial in the maximal bag size 
and linear in the complexity of the underlying instance hypothesis class. This is formally stated in the 
following theorem, for the case of a realizable distribution over labeled bags. Note that in particular, the 
theorem holds for all the p-norm bag-functions, since they are 1-Lipschitz and satisfy the boundedness 
conditions. 

Theorem 30 Let H C [—1. +1]"^ be a hypothesis class with pseudo-dimension d. Let B be a one-sided 
PAC-learner for % with complexity c(e, (5). Let r g N, and let R C [r]. Assume that the bag function 
ip : [— Ij+l]*-^' — > [—1, +1] is 1-Lipschitzwith respect to the infinity norm, and that for any {zi, ..., Zn) G 



-1,+1](^) 



- y^ 2i < ip{zi,. . .,Zn) < m&XZi 



Assume that Ti. is compact with respect to any sample of size m. Let D be a distribution over X^^' x 
{ — 1, +1} which is realizable by %, that is there exists an h ^ % such that Pjx y-)^^[ft,(X) = Y] = 1. 
Assume m > lQdln{er), and let e = -^ and k = 32(2r — 1)^ ln(?Ti). 

For all S € (0,1), if AdaBoost* is executed for k rounds on a random sample S ^ D™, with 
MILearn «."5/2fc as the weak learner, then with probability 1 — 5, the classifier /o returned by AdaBoost* 
satisfies 

F.[r/(X) < 0] < O Udrn.ir)X.\rn)^Hm\ ^^S) 

Proof Since S is a one-sided PAC-learning algorithm, Cf 5/2^. is one-sided-e-optimal with probability at 

least 1 — 6 /2k. Therefore, by case (2) of Cor|28j if MILearn '•*/* receives a weighted bag sample S'w, 
with probability 1 — 5 /2k it returns a bag hypothesis ho E T-L+ such that 



r(/io,^w)> 



^''^Phen{'H,s) r(^' '5'v 
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Thus, by Cor.l23l if AdaBoost* runs for k rounds then with probability 1 — d/2 it returns a linear combi- 
nation of hypotheses from T-L+ such that 



Mifo,S) > '-^f^-im^s,^'^HLS)-r^e _ ^___ 



2r- 1 



(29) 



Due to the realizability assumption for D, there is an h E fl{T-L,S) that classifies correctly the 
bag sample S. It follows that for any weighting w e A„j of S, T{h, ^w) = 1- It is easy 
to verify that since H is compact with respect to S, then so is n{T-L,S). Thus, by Theorem l22l 
supjg^jj,Q/:jj g^^ M {f , S) — miiiw sup,jgf2(?7 S) ^e*' "^w) — 1- Substituting e and k with their values, 
setting supf^^,^,-^ ^ss M{f, 5) = 1 in Eq. (|29b and simplifying, we get that with probability 1 — 5/2 

Mifo,S)>-^^. (30) 
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We would now like to apply the generalization bound in Eq. ( |24] |. but for this we need to show that 
Eq. dZST l holds for H. We have the following bound on the covering numbers ofH, for all 7 e (0, 1]: 

\ d 



The first inequality is due to Cor. [13] and the fact that t/j is 1-Lipschitz, and the second inequality is due to 



Haussler and Lon^ (119951) and the pseudo-dimension ofH (see Eq. dZST l above). This implies 



d 
em 



^ d(ln(101n(er))+ln(r)) 



^7 • 10(iln(er) 
Therefore, for m > l{)dhi{er) 

A/;„(7,W+,oo) < 1 +AA„,(7,H,oo) < 1 + ( ^"7^ X. e^(MlOIn(e.))+InW) 

\7 • 10aln(er)y 

^ / em \ d(In(10In(er))+In(er)) 

- V7- 10dln(er)y 

Now,ln(101n(er)) + ln(er) = ln(10)+ln(ln(er)) + ln(er) < ln(10) + 21n(er) < 3 + 21n(er) < 51n(er). 
Therefore, 

/ \ d / 9 \ 5dln(er) / s lOd Infer) 

Thus, for m > lOd In(er), Eq. dZST l holds for H+ when substituting d with dr = lOd In(er). This means the 
generalization bound in Eq. ( l24l i holds when substituting d with dr as well. It follows that with probability 

I -6/2 



[YUX) < 0] < O ' ■ /^'■ln'(Wrfr.)/M2(/o,^) + ln(l/<5) 

Now, with probability 1 — 5/2, by Eq. (|30ll we have Af(/o, S") > l/(8r— 4). Combining the two inequalities 
and applying the union bound, we have that with probability 1 — 5 



nvuix) < 0] < o U'ir{^r-A)n.\m/dr)^Hm\ 



_ ; 10dlii(er)(8r - 4)2 In^(m) + lii(2/5) 



Due to the O-notation we can simplify the right-hand side to get Eq. 



Similar generalization results f or Boosting can be der ived for margin-learning as well, using covering 



numbers arguments as discussed in lSchapire etal.1 (Il998h . The theorem above leads to the following con- 
clusion. 
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Corollary 31 If there exists a one-sided PAC-Ieaming algorithm for % with polynomial run-time in - and 
J-, then there exists a PAC-learning algorithm for classical MIL on T-L, which has polynomial run-time in 

r,- and t- 

e o 

Cor. [3T|is similar in structure to Theorem [T] Both state that if the single-instance problem is solvable 
with one-sided error, then the realizable MIL problem is solvable. Theorem [T] applies only to bags with 
statistically independent instances, while Cor. |3T] applies to bags drawn from an arbitrary distribution. The 
assumption of Theorem [T] is similarly weaker, as it only requires that the single-instance PAC-learning 
algorithm handle random one-sided noise, while Cor. [3T] requires that the single-instance algorithm handle 
arbitrary one-side d noise. Of course, Cor.[3T|does not contradict the hardness result provided for APRs in 



Auer et al.l (Il998h. Indeed, this hardness result states that if there exists a MIL algorithm for d-dimensional 
APRs which is polynomial in both r and d, then TZV = MV. Our result does not imply that such an 
algorithm exists, since there is no known agnostic or one-sided PAC-learning algorithm for APRs which is 
polynomial in d. 

We have shown a simple and general way, independent of hypothesis class, to create a PAC-learning al- 
gorithm for classical MIL from a learning algorithm that runs on single instances. Whenever an appropriate 
polynomial algorithm exists for the non-MIL learning problem, the resultin g MIL algorithm will also b e 



polynomial in r. To illustrate, consider for instance the algorithm proposed in lShalev-Shwartz et alj ( I2OI0I) . 



This algorithm is an agnostic PAC-learner of fuzzy kernelized half-spaces with an L-Lipschitz transfer func- 
tion, for some constant L > 0. Its time complexity and sample-complexity are at most poly((-|)^ • ln(|)). 
Since this complexity bound is polynomial in 1/e and in 1/(5, Cor. [311 applies, and we can generate an al- 
gorithm for PAC-learning MIL with complexity that depends directly on the complexity of this learner, and 
is polynomial in r, ^ and j. More generally, using the construction we proposed here, any advancement 
in the development of algorithms for agnostic or one-sided learning of any hypothesis class translates im- 
mediately to an algorithm for PAC-learning MIL with the same hypothesis class, and with corresponding 
complexity guarantees. 

8. Conclusions 

In this work we have provided a new theoretical analysis for Multiple Instance Learning with any underlying 
hypothesis class. We have shown that the dependence of the sample complexity of generalized MIL on the 
number of instances in a bag is only poly-logarithmic, thus implying that the statistical performance of 
MIL is only mildly sensitive to the size of the bag. The analysis includes binary hypotheses, real-valued 
hypotheses, and margin learning, all of which are used in practice in MIL applications. For classical MIL, 
where the bag-labeling function is the Boolean OR, and for its natural extension to max, we have presented a 
new learning algorithm, that classifies bags by executing a learning algorithm designed for single instances. 
This algorithm provably PAC-learns MIL. In both the sample complexity analysis and the computational 
analysis, we have shown tight connections between classical supervised learning and Multiple Instance 
Learning, which holds regardless of the underlying hypothesis class. 

Many interesting open problems remain for the generic analysis of MIL. In particular, our results hold 
under certain assumptions on the bag functions. An interesting open question is whether these assump- 
tions are necessary, or whether useful results can be achieved for other classes of bag functions. Another 
interesting question is how additional structure within a bag, such as sparsity, may affect the statistical and 
computational feasibility of MIL. These interesting problems are left for future research. 
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Appendix A. Proof of Theorem |27] 

The first step in providing a guarantee for the edge achieved by MILearn, is to prove a guarantee for the 
edge achieved on the bag sample by the hypothesis returned by A in step ^ of the algorithm. This is done 
in the following lemma. 

Lemma 32 Assume tp : [— 1,+!]^ •* — > [~1,+1] is an {a, b,c,d)-bounded bag function with < a < c, 
and denote Z — -. Consider running the algorithm MILearn with a weighted bag sample S of total weight 
1. Let hi be the hypothesis returned by the oracle A in step Q q/ MILearn. Let W be the total weight of 
the sample Si created in MILearn, step HQ}. Then 

L If A is e-optimal, 

T(hi,S) > Z-f* + {^- Z+{l-^){d- Zb))W+ + Zb-d-eW. 

2. If A is one-sided- €-optimal, and i^{zi^ . . . , z^) = —1 only if zi ^ . . . = Zn = —1, then 

r(hi,S) > i^; + (i - Z + (1 - |)(rf - Zb))W+ + Zb-d+Z-^-eW. 

Proof For all h £ V., and for all x = (xi, . . . , a;„) e X^^-^ we have /i(x) = ^{h{xi), . . . , h{xn))- Since 
ip is (a, &, c, (i)-bounded, it follows that 

a^h{x)+b<h{yi)<c^h{x)+d. (31) 

x^x a:Gx 

In addition, since a and c are positive we also have 

(h{±) ~ d)/c < Y^ h{x) < (7^(x) - b)/a. (32) 

Assume the input bag sample is S" = {{'Wi,i(.i,yi)}i^[m]- Denote /+ = {z G [m] \ yi = +1} and 
/_ = {i G [to] I yi = —1}. Let /i € H be a hypothesis. We have 



T{h,S) = ^ Wih{yi.i) - ^ Wih{:x.i) 
iei+ iei- 

> Y^ Wi{a Y H^) +b)-Y'^'^^Yl ^'(^) + '^^ *^^^^ 

ie/+ 2;£xi is/- x&Xi 

— y Wia y h{x) — y WiC y h{x) + y Wib— y wid. (34) 

iG/+ a-'GXi ie/- kSx; j£/+ iS/- 

line ( l33T l follows from Eq. dJTT l. As evident by steps ( I1I2| | of MILearn, In the sample Sj all instances 
from positive bags have weight q;(+1) = a, and all instances from negative bags have weight a(— 1) = c. 
Therefore 

T{h,Si) = X! X! Wiyia{yi)h{x) = ^ WiU ^ h{x) - X! ^^'^ X! '*(^)- 

i£[m]a;GXi iG/+ 2;SXi iG/- 2;GXi 
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Combining this equality with Eq. ( |34] | we get 

r(7^, 'S) > T{h, Si) +^Wib-Y^ wd- 
iei+ iei- 

Since X^ig/ ^i — ^+ ^"d J2iei_ '^j = ^- — ^ ^ ^+' it follows that 

T(h,S) > T{h, Si) + hW+ - dW^ = T{K Si) + {h + d)W+ - d. (35) 

Now, for any hypothesis h we can conclude from Eq. ( l32b that 

r(/i, Si) = 2, o.'u'i /_. ^(s^) ^ /, cwi N, '1(3;) 

> ^ awi{h(xi) - d)/c - ^ cwi{h{5ci) - b)/a 
iei+ iei- 

= y^ —Wih{xi) — y^ —Wih{xi) — y^ adwi/c+ \^ cbwi/a 
iei+ iei- iei+ iei- 

= ^T(h,^) + (^ - £) V ^,7^(x,) - ^W+ + -W. 
a c a ^ — ' c a 

= -r(/i, s) + { ) > w^h{x,) - (— + —)w+ + —. 

a c a ^-^ c a a 

In the last equality we used the fact that W- = 1 — W+. Since Z = -, it follows that 

r(/i, 5/) > ZT(h, S) + {^-Z)y] wMx,) - (4 + Zb)W+ + Zb. 

Zj ^ Zj 



(36) 



We will now lower-bound the right-hand- side of Eq. (|36] |. Note that ^ — Z < since c> a. Therefore we 
need an upper bound for J^iei ^j^(^j)- We consider each of the two cases in the statement of the lemma 
separately. 

Case 1: A is e-optimal We have J2iei Wj/i(xi) < J2tei '^i = ^+- Therefore, by Eq. ( l36b for any 

hen 

r{h, Si) > Zr(h,S) + (^ - Z - - - Zb)W+ + Zb. (37) 

For a natural n, set ft," such that r(ft^ , -S) > 7* — -. We have (see explanations below) 

r(7i/,S') >r(/i/,5/) + (6 + d)V7+-d (38) 

>V{h!l,Si) + (b^d)W+-d-eW (39) 

> ZViKl.'S) + {^-Z--- Zb)W+ + Zb+{b + d)W+ -d~eW (40) 

Zi z 

= Zr(C 5) + (4 - Z + (1 - ^)(d - Zb))W^ + Zb-d-eW 

Z Z 

> Z{Y --} + {^-Z+{l-^){d- Zb))W+ + Zb-d-eW. 

n Z Z 

Eq. (l38l l is a restatement of Eq. (l35l l. Eq. ( |39] | follows from the e-optimality of A. Eq. ( |40| ) follows from 
Eq. dJTJ l. By taking n — > oo, this inequality proves case (1) of the lemma. 
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Case 2: yl is one-sided-e-optimal We have ^^^^^ Wih{x.i) < J2iei ^» = ^+- Let /i e QCH^S). 
Then for alH e /_, h(xi) = —1. Therefore 



r{h,S) = ^ Wih{x.i) - ^ Wih{xi) 
iei+ iei- 



JG/+ 



Therefore I]je/+ '^iK^i) = r(ft,, 5) - W^_ = r(/i, S") + V7+ - 1. Combining this with Eq. (O we get 

T{h, Si) > Zr(7^, 5) + (^ - ^) V t«.7^(x.) - (4 + ^6)VK+ + Zb 

iei+ 

= ZT(h,S) + (| - Z)(r(7I, 5) + Ty+ - 1) - (| + Zb)W+ + Zb. 

= lr(7I,^) + (1 - Z - I - Z5)M^+ + Z6 - i + Z. (41) 

For a natural n, set 7i" e fl{Tl,^) such that r(7i" ,^) > 7+ - ^. For all bags i e /_, 7i" (x^) = -1. 
Thus ?/'(/i" (a;i[l]), . . . , /i" (a;i[|xi])) = —1. By the assumption on ip in case (2) of the lemma, this implies 
that for all j € I~,j G [|x,;|], /i"(a;i[i]) = ^1- Therefore /i" S ^,{1-1, Si). We have (see explanations 
below) 

r(ft:/,^)>r(;i/,5/) + (5 + rf)VK+-d (42) 

>r(;i!;:,S'/) + (6 + d)VK+-rf-eW^ (43) 

> 4r(7^" , 5) + (^ - Z - 4 - Z6)W^+ + Z6-4 + ^+(& + rf)W"+ - d - eW^ (44) 

Zj Zj Zj Z 

= \v(hl,-S) + (1 - Z + (1 - l)(d - Z6))H^+ + Z5-rf + Z-i-eW^ 

>\{nX- -) + {)-- Z +{\-\){d- Zb))W^ + Zb- d + Z -\- eW. 
Z n Z Z Z 

Eq. (l42l) is a restatement of Eq. ( l35T l. Eq. ( |43] ) follows from the one-sided-e-optimality of A and the fact 
that /i" G ^(H, 5/). Eq. (l44l i follows from Eq. W\S . By considering n — > oo, this proves the second part 
of the lemma. ■ 

Proof [of Theorem l27l MILearn selects the hypothesis with the best edge on S between hi and hpos- 
Therefore 

r(/io,^) = max(r(hpos,^),r(7^/,^)). 

We have 

r(hpos,^) = Y^ w,;y.hpos(x^) = ^ w^y, ^W+-W-^ 2W+ - 1. 

iG [rn] zG [m] 

Thus 

r(;io,5') =max(2W+-l,r(7^/,5)). (45) 
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We now lower-bound T{ho, S) by bounding r{hj, S) separately for the two cases of the theorem. Let W 
be the total weight of Sj. Since R C [r], a < c, and X^iefml "^i — 1' ^^ have 

W = y^ y. ^'^i + 7, 7, cwi < re 2^ Wi = re (46) 

Case 1: A is e-optimal From Lemma[32]and Eq. ( l46b we have 

r(7^/, 5) > ^7* + (^ - ^ + (1 - 4)(c^ - ^&))W^+ + Zh-d-rce 

= Z7* + (i - Z + (1 - |)(Z - 1 + r?))T4^+ - (Z - 1 + r,) - rce 

= Z7* + (77 - 2)(1 - -)W+ + 1-1]- Z - rce. 

The second line follows from the assumption d— Zh— Z+\ = t]. Combining this with Eq. ( 1451 ) we 
get 

r(/io, S^) > max{2P^+ - 1, Z7* + [j] - 2)(1 - \)W+ + 1 - tj - Z - rce}. 

The right-hand- side is minimal when the two expressions in the maximum are equal. This occurs when 

w w A Z-i* +2-T]~ Z -rce 
"'* = "'" = 2 + (2-,)(l-i) - 

Therefore, for any value of W+ 

Zr-Z+i-m + i)-rce 



r{ho,s) > 2Wo-i 



i + (i-f)(i-i) 

Case 2: ^ is one-sided-e-optimal From Lemma|32]and Eq. (l46T l we have 

r(7^/,:S) > 1^; + (1 - Z + (1 - l)(rf - Zb))W+ + Zb-d+ Z - ^ -rce 

= I7; + (i - Z + (1 - 1)(Z - 1 + rj))W+ - (Z - 1 + r;) + Z - i - rce 
= ^7; + l'? - 2)(1 - |)M/+ + 1 - ,7 - 1 - rce. 
The second line follows from the assumption d — Zb = Z — 1 + ij. Combining this with Eq. ( |45] | we get 

Tiho,S) > max{2W^+ - 1, ^7; + iv ~ 2)(1 - ^)W+ + 1 - r; - i - rce}. 
The right-hand-side is minimal when the two expressions in the maximum are equal. This occurs when 

A 7X-l + (2-?7- rce)Z 

W+ = Wa ^ — ^ — 

+ ° 2Z+(2-r;)(Z-l) 

Substituting W+ for Wo in the lower bound, we get 



T{ho,S)>2Wo-l^ 



7;-f(Z + l)-rceZ 
2Z-l-f(Z-l) ■ 
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